CytoSignal detects cell-cell signaling from spatial transcriptomics data at single-cell resolution. CytoSignal performs a nonparametric statistical test to identify which cells within a tissue have significant activity for a particular signaling interaction. CytoSignal considers multi-component interactions and separately models interactions mediated by diffusion vs. contact-dependent molecules.
In this tutorial, we demonstrate how to preform CytoSignal on a mouse embryonic E14 brain data captured by Slide-tags, a spatial transcriptomic protocol at single-cell resolution. This data was originally produced in the work of Andrew J. C. Russell et al., 2023, and is now publicly available on Single-Cell Portal via SCP2170. For convenience, we have pre-processed the data so it can be easily loaded into user R environment, which is available at figshare.
Download the files listed above from the provided link into a desired local directory.
Next, load the downloaded data into the R environment using the
following codes. We assume that users have the files placed at the
current working directory (as can be shown by getwd()).
Alternatively, users can also specify the path to the files.
## The RDS file will be loaded into a ready-to-use object
dge <- readRDS("SCP2170_annotated_dgCMatrix.rds")
## The cluster annotation need to be presented as a factor object
cluster <- read.csv("SCP2170_cluster.csv")
cluster <- factor(cluster$cell_type)
names(cluster) <- colnames(dge)
## The spatial coordinates need to be presented as a matrix object
spatial <- as.matrix(read.csv("SCP2170_spatial.csv", row.names = 1))
## Please make sure that the dimension names are lower case "x" and "y"
colnames(spatial) <- c("x", "y")
Then a CytoSignal object can be created with the following command.
library(cytosignal)
cs <- createCytoSignal(raw.data = dge, cells.loc = spatial, clusters = cluster)
CytoSignal has a built-in ligand-receptor interaction database resorted from CellphoneDB, which can be simply loaded into the object running the following command.
cs <- addIntrDB(cs, g_to_u, db.diff, db.cont, inter.index)
Next, remove low-quality cells and genes, retain only those genes available in the interaction database, and convert gene names to Uniprot IDs.
cs <- removeLowQuality(cs, counts.thresh = 300)
cs <- changeUniprot(cs)
The spatially resolved interaction scores of each interaction in each location (LRscore) is defined as the co-expression of ligand and receptor genes within close spatial proximity. The computation can be divided into three main steps: 1) defining spatial neighbors for each location; 2) calculating the amount of ligand (\(L\)) and receptor (\(R\)) each location can receive from their spatial neighbors; 3) calculating ligand-receptor co-expression within each spatial neighborhood; 4) performing spatial permutation test to infer significant interactions.
CytoSignal defines spatial neighborhoods for diffusion-dependent and contact-dependent interactions differently. We use the Epsilon ball approach for diffusible interactions and Delaunay Triangulation for contact-dependent interactions.
For diffusion-dependent interactions, for each location \(i\), we define its spatial neighbors as all locations \(j\) within a circle centered on location \(i\) with a predefined radius \(r\) (200 µm by default). We next weight the amount of \(L\) that \(i\) receives based on the physical distance between \(i\) and \(j\) transformed by a Gaussian kernel. For determining the parameters of this kernel, we will need a scaling factor between the arbitrary units of the spatial coordinates and real unit (such as µm), which is based on prior knowledge of user dataset. For the specific example Slide-tags data, the conversion ratio is 0.73.
cs <- inferEpsParams(cs, scale.factor = 0.73)
It’s recommended to review the inferred parameters for a sanity check.
cs@parameters$r.diffuse.scale
## [1] 273.9726
cs@parameters$sigma.scale
## [1] 73.70953
For each location, we can then use findNN() to find its
spatial neighborhood and then calculate the weights between each it and
its neighbors. The results for diffusible interactions are saved with
model name GauEps and contact-dependent interactions are
saved with model name DT.
cs <- findNN(cs)
The next step is to calculate the amount of ligand (\(L\)) and receptor (\(R\)) each location can receive from their
spatial neighbors using function imputeLR.
cs <- imputeLR(cs)
For calculating the LRscore of each interaction within each spatial neighborhood, we multiply \(L\) and \(R\) within each location and apply an average within its DT neighborhood. Next, we perform a spatial permutation test, calculate the null distribution of the imputed ligands and receptors, and calculate a one-sided p-value. To control for multiple hypothesis testing and potential biases caused by cellular density differences, we further perform spatial false discovery rate correction.
The output of CytoSignal is a test statistic \(S\) and adjusted p-value for each signaling
interaction within each location in the tissue. Finally, cells with
significant signaling activity can be identified by setting a
significance level such as p.value = 0.05. For convenience,
we rank interactions by either the number of significant spatial
positions or the spatial variability statistics of the LRscore
calculated by SPARK-X.
The function inferIntrScore is an integrated function
that performs the LRscore calculation and permutation tests. For the
calculation of LRscore, by default, receptors are taken from the
normalized expression. Ligands of diffusible interactions are imputed
with Gaussian Epsilon ball (GauEps-Raw), while those from
contact-dependent interactions are imputed with Delaunay Triangulation
(DT-Raw). The function also provides an option
recep.smooth to smooth the receptor expressions with
Delaunay Triangulation (GauEps-DT and DT-DT),
which is useful when the sparsity of the data is relatively high. The
function inferSignif() should be called subsequently to
identify significant interactions. For ranking the interactions by
spatial variability statistics, we provide
rankIntrSpatialVar().
cs <- inferIntrScore(cs)
cs <- inferSignif(cs, p.value = 0.05, reads.thresh = 100, sig.thresh = 100)
cs <- rankIntrSpatialVar(cs)
reads.thresh is the minimum number of reads for a
ligand-receptor interaction to be considered. sig.thresh is
the minimum number of beads for a ligand-receptor interaction to be
considered. The three thresholds can be changed arbitrarily if the
number of the significant beads is too large or too small. The function
inferSignif() by default is applied to all LRscore types
available. When any thresholds need to be tweaked for a specific LRscore
type, users can specify the slot.use argument
explicitly.
Users can use function showIntr() to view all
significant interactions.
The argument slot.use is for specifying the LRscore
calculation model used. Possible options are explained as followed:
"GauEps-Raw": Ligand expressions are imputed with
Gaussian Epsilon ball and receptor expressions are taken from normalized
expression. Typically for the diffusion-dependent interactions."DT-Raw": Ligand expressions are imputed with Delaunay
Triangulation and receptor values are taken from raw expression.
Typically for the contact-dependent interactions."GauEps-DT": Ligand expressions are imputed with
Gaussian Epsilon ball and receptor values are imputed with Delaunay
Triangulation. This is for diffusion-dependent interactions, but exists
only when inferIntrScore() is run with
recep.smooth = TRUE."DT-DT": Ligand values and receptor values are all
imputed with Delaunay Triangulation. This is for contact-dependent
interactions, but exists only when inferIntrScore() is run
with recep.smooth = TRUE.The argument signif.use can return interactions that are
ranked by different metrics:
"result": have p-value less that specified
threshold."result.hq": have significant p-value and passes the
quality control on the minimum number of reads and beads as mentioned
above."result.spx": are spatially variable and are of high
quality according to that in "result.hq".Setting return.name = TRUE displays both interaction
unique IDs and the interaction names for easier interpretation.
Interaction names are shown as a “ligand-receptor” gene symbol.
allIntrs <- showIntr(cs, slot.use = "GauEps-Raw", signif.use = "result.spx", return.name = TRUE)
print(head(allIntrs))
## CPI-SS0659DBE72 CPI-SS04124F4E1 CPI-SS0008137B6 CPI-SS00F4DDF4B CPI-SS0E063192D
## "EFNA4-EPHA4" "EPO-EFNB2" "EFNA4-EPHA5" "PTN-PTPRS" "NRG2-ERBB4"
## CPI-SS0C694CB44
## "VEGFA-NMDE2"
CytoSignal makes cellular-resolution, spatially-resolved signaling inference. We developed several new visualizations for plotting each inferred signaling interaction.
Most of the functions take two arguments, intr and
slot.use, for specifying individual interactions. Users can
first show a list of significant interactions that is available for
visualization with showIntr(). intr should
then be the unique ID of available interaction(s) and
slot.use should be a selection as explained above.
We developed a 3D edge plot for visualizing the signal-sending and signal-receiving cells of an interaction at single-cell resolution. The plot comes with two layers of scatter plot of cells placed at the top and bottom of a 3D box space, for labeling the receivers and senders, respectively. In each layer, we use low transparency to highlight the cells that is sending or receiving the signal of the specified interaction. Cells are colored by cluster at the mean time. Finally, we bring lines that connect the senders with their corresponding tentative receivers, which are the edges.
intr.use <- names(allIntrs)[1]
plotEdge(cs, intr.use, slot.use = "GauEps-Raw", pt.size = 0.15)
To plot the cluster annotation legend alone, users can simply use
plotCluster() for reference.
plotCluster(cs)
We provide a function named plotSignif2 for making a
general combination figure. This function plots on a per-interaction
basis, and for each interaction it shows 1) the imputed gene expression
of the ligand and receptor; 2) the original raw expression values of the
ligand and receptor; 3) the inferred LRscore; 4) cluster annotations for
each location; 5) 3D edge plot for visualizing the signal-sending and
signal-receiving cells. General graphical setting can be found in the
documentation of the function (?plotSignif2).
plotSignif2(cs, intr = intr.use, slot.use = "GauEps-Raw", return.plot = TRUE, edge = TRUE, pt.size = 0.2)
Recommended: When plotting a large number of
interactions, it’s recommended to use numeric index with
intr and turn to return.plot = FALSE (by
default). In this way, plots of all input interactions at high
resolution will be stored on the device instead of on the screen.
The codes below show the top 5 significant interactions ranked by SPARK-X for both diffusion-dependent and contact-dependent interactions.
# For diffusion dependent interactions
signif.use <- "result.spx"
lrscore.slot <- "GauEps-Raw"
plot_dir <- paste0("path_to_result/diff-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)
# For contact dependent interactions
## Choose a level of significance metric
lrscore.slot <- "DT-Raw"
plot_dir <- paste0("path_to_result/cont-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)
We provide a few downstream analysis approaches to further investigate the biological significance from interactions identified by CytoSignal.
With function inferIntrDEG(), we can identify genes that
are differentially expressed in the cells that are highly correlated to
an interaction. After obtaining a gene set for each interaction, we
further fit it to a glmnet model
to do a regression analysis. We finally examine the coefficients of the
model to determine the significant genes for the interaction.
The returned object is a list where each element is a result list for
each interaction. The sub-entry
intrDEG[[intr.use]]$sign_genes is the character vector
containing the genes that are significant for the interaction.
intrDEG <- inferIntrDEG(cs, intr = intr.use, slot.use = "GauEps-Raw", signif.use = "result.spx")
intrDEG[[1]]$sign_genes
## [1] "Gm29260" "Erbb4" "Epha4" "Zeb2"
## [5] "Neb" "Mpped2" "Meis2" "Efna4"
## [9] "Mob3b" "Mki67" "Gm45680" "B020031H02Rik"
## [13] "Sox1ot" "Hmgb2" "Nfix" "Cdon"
## [17] "Gm30373" "Nrxn3" "Gli3" "Gas1"
## [21] "Dleu2" "Gm27017" "Celsr1" "Zbtb20"
## [25] "Tcf4" "Gldc" "Emx2os" "Sytl5"
With the gene set, we can further do a GO term enrichment analysis. Technically this can be done with any GO tools of preference. We used GORILLA in our manuscript. Here we provide an example of using the R package gprofiler2.
# Need to set `evcodes = TRUE` to get list genes that hits each GO term
goRes <- gprofiler2::gost(intrDEG[[1]]$sign_genes,
organism = "mmusculus", evcodes = TRUE)
We provide one more function heatmap_GO() for showing
the the GO term by the genes submitted. The heatmap will be colored by
the coefficient value that is calculated from the regression analysis,
while white-colored grids indicate none intersection between the
corresponding GO terms.
With GO analysis conducted with other tools, we require at least three columns from the result table for this function to work:
description.col and the content will be shown as the row
labels of the heatmap.pval.col."gene1, gene2, gene3" so we can identify the hit. The
name of this column need to be set to argument gene.col.
This information could have been formatted variously by different tools,
users might need to further provide a splitting function that splits the
string into a character vector of the gene names to argument
gene.split.fun.heatmap_GO(intrDEG, goRes$result, intr = intr.use,
description.col = "term_name", pval.col = "p_value",
gene.col = "intersection")
In our manuscript, we used another web-based tool REVIGO to visualize the GO terms. Users
can simply copy the term IDs from the result into the web form and view
the visualization interactively in browser. For users with limited
access to interactive session (e.g. on HPCs), we provided function
revigo() to programmatically submit query and fetch the
result from their server. The queried result allows us generating a
scatter plot where each point represents a GO term and more semantically
similar GO terms are also closer in the plot. The color scale presents
the value when we submit the query, which is the log
transformed p-value in this example. The size of dots indicates the
frequency of the GO term in the underlying GOA database, so that bubbles
of more general terms are larger.
revigoRes <- revigo(GOterms = goRes$result$term_id,
value = goRes$result$p_value, valueType = "PValue",
speciesTaxon = "10090")
plotREVIGO(revigoRes, labelSize = 3)